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SUMMARY 


Fundamental to the development of redundant software techniques (known as 
fault-tolerant software) is an understanding of the impact of multiple joint 
occurrences of errors, referred to here as coincident errors. A theoretical 
basis for the study of redundant software is developed which (1) provides a 
probabilistic framework for empirically evaluating the effectiveness of the 
general (N-Version) strategy when component versions are subject to coincident 
errors, and (2) permits an analytical study of the effects of these errors. 

The basic assumptions of the model are: (i) independently designed software 
components are chosen in a random sample and (ii) in the user environment, the 
system is required to execute on a stationary input series. An intensity 
function, called the intensity of coincident errors, has a central role in the 
model. This function describes the propensity of a population of programmers 
to introduce design faults in such a way that software components fail together 
when executing in the user environment. The model is used to give conditions 
under which an N-Version system is a better strategy for reducing system 
failure probability than relying on a single version of software. In addition, 
a condition which limits the effectiveness of a fault-tolerant strategy is 
studied, and we ask whether system failure probability varies monotonically 
with increasing N or whether an optimal choice of N exists. 
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1.0 INTRODUCTION 


The use of independently designed, redundant software is an intuitively 
appealing approach to increasing software reliability. The redundancy 
principle, after all, has long been accepted as an effective means for 
improving the reliability of hardware devices. The basic premise in both cases 
is that components (either software or hardware) will have independent failure 
characteristics so that the probability of failures occurring simultaneously is 
small (ideally the product of the individual component failure probabilities). 
Fault-tolerant software is the methodology for structuring software components 
to cope with residual software design faults. The most widely known, N-Version 
programming [1] and recovery block [2], are analogous to the hardware 
techniques of N-Modular redundancy and stand-by sparing, respectively. 

Although redundancy has been successfully applied to fault-tolerant 
computer systems (e.g., [3], [4]), its application to software has been slow to 
develop. One reason for this may be that little empirical data is available 
that demonstrates an increase in reliability ' sufficient to justify the 
increased cost of the software development, although it has been suggested that 
fault-tolerant software is cost effective [5]. 

More importantly, however, is the reliability degradation of fault-tolerant 
software structures caused by either: (1) multiple faults which produce 
dissimilar outputs but are manifested by the same input conditions, or (2) 
related software design faults causing identical incorrect outputs. The 
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general notion of related software design faults is often referred to as 
"correlated" faults. This term, however, appears to have different meanings to 
different authors and it is sometimes not clear what combinations of the above 
fault types and the degree of the attribute, "related", is being discussed. We 
will refer, collectively, to errors manifested by both of the above fault types 
as coincident errors. What will distinguish correlated errors from those that 
occur simultaneously by chance, we presume, will be the intensity of coincident 
errors as discussed in more detail later. In an extreme case one might imagine 
that all residual design faults are common to all versions of a redundant 
software structure and thus there is no reliability gain over randomly 
selecting a single version of the software. More typical might be a situation 
where a majority of identical faulty modules, in a voting scenario, outvote the 
correct versions which are in the minority. 

Although it is true that detected failures are potentially less serious 
than undetected failures since control, in the case of detected failures, can 
be passed to a higher authority, both are, in fact, failures of the fault- 
tolerant structure. For applications in which fault-tolerant software is 
performing some critical function, we take the conservative position that any 
higher authority could not adequately cope with this loss of critical function 
and that there is no safe-down state to repair the software (more likely reset 
to some initial state). Thus we are concerned with both types of errors, which 
are described by coincident errors. 
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Given that coincident errors are potentially devastating to redundant 
software systems, it is fundamental to understand and assess the effects of 
these errors, both analytically and empirically, on the general strategy of 
software redundancy. Hardware designers, to date, have not been concerned with 
this issue. The assumption is that hardware components do not share common 
design faults but rather it is their independent degradation processes which 
mainly contribute to unreliability. Independence, then justifies the use of 
combinatorial methods for estimating hardware reliability. In the independence 
case, conditions for which redundancy is a better strategy for reducing failure 
probability than the use of a single component are well known [6]. 

In the case of redundant software, it is suggested in [7] that the 
independence model when applied to software components leads to poor 
predictions of reliability. Further, the analysis given in this paper shows 
that for cases of coincident errors which appear reasonable to expect in 
applications, the independence model gives estimates which fail to be 
conservative. 

Upon recognizing that statistically independent failures among software 
components is a questionable assumption, the model suggested in [8] includes a 
"correlation" factor. However, it too assumes a form of higher order 
independence by representing the probabilities of joint occurrence of 
identical, incorrect output in terms of the probability of pairwise occurrence 
of such events. Furthermore, since the probability of identical, incorrect 
output among component versions will likely vary with the input, the idea that 
all of this complexity can be captured in a single scalar correlation 
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coefficient is questionable. For this 1 2 reason, we employ an intensity function 
defined on the input space (similar in several ways to a parameter vector) 
which permits variation in the probability that software components fail 
together. We shall not attempt to evaluate one fault-tolerant technique over 
another but rather we shall examine the principle of redundant software as 
represented by multiple (i.e., N) versions which are independently developed to 
a common set of requirements and then operationally subjected to a perfect 
majority voter. 

We submit that there are a number of questions which must be answered in 
order to provide a basic understanding of the effects of coincident errors on 
redundant software. The framework discussed in the present paper does not 
require unnecessary assumptions concerning independent failure of software 
components; rather a model is derived from assumptions concerning the process 
of selecting independently designed software components and testing them on an 
input series chosen to emulate the user environment. In other words, we 
believe the model has sufficient generality to warrant conclusions concerning 
questions of the following type: 

(1) Is an N-Version software structure always more effective at reducing 
failure probability than a single version of software? If not, what are 
the conditions which cause this? 

(2) What are the effects of different intensities of coincident errors on a 


general N-Version system? 



(3) What are the effects of increasing N? Does the failure probability always 
increase or decrease with increasing N, as for the independence model used 
for hardware, or might there exist an optimal choice of N other than N=1 or 
N=«>? Is there a limit on the effectiveness of fault-tolerance at reducing 
the probability of failure? 

(4) Does the independence model give a valid estimate of the failure 
probability of a redundant N- Vers ion system? 

(5) Under what conditions does the assumption of independence hold? 

In order to give a framework for evaluating the effectiveness of a fault- 
tolerant strategy and, in particular, to answer the above questions, we propose 
a model based on formalizing the notion of coincident errors. The basic 
assumptions of this model are: (i) that independently designed software 
components are chosen in a random sample and (ii) each component and each 
system is required to execute on a stationary, independent input series. We 
derive the failure probability of a redundant N-Version system and establish 
general conditions giving answers to (1), (3), and (5) above. The main 
quantities describing the model are: an intensity function defined on the 

input space which models the occurrence of coincident errors and a usage 
distribution which gives the probabilities of inputs occurring in various 
subsets. Also important to our description is an intensity distribution 
derived from the intensity function and the usage distribution. The intensity 
distribution completely specifies the failure probability of a redundant 
system; that is, if the intensity distribution is known or can be estimated, 
answers to questions of type (1) - (3) can be given. Since empirical 
information concerning the intensity distribution is unavailable, we study the 
effects of coincident errors by varying the choice of intensity distributions. 


6 



Notation 


We follow the usual convention in which random variables are denoted by 
capital letters and their realizations are denoted by the corresponding lower 
case. We also use the' following: 

8 input set for software components designed to a common 

specification; 

x a variable representing elements of 8; 

Q the usage distribution, a probability measure defined on 

(measurable) subsets of 0; 

v(x) the score function, a binary function distinguishing the 

occurrence of correct and incorrect output when a software 
component executes on xetl; 

9(x) intensity of coincident errors; 

E ( • ) (P( * ) > mathematical expectation (probability) derived from a product 

probability space as specified by the two-stage process of 
selecting software components at random and testing them on inputs 
chosen at random from 8; 

p^ average probability of failure of an N-Version system 

p average probability of failure of a single software component; 

N number of software components in a multiple version program; 

n number of software components chosen in a random sample; 

G(y) intensity distribution induced by the mapping x -» e(x) from 8 

into [0, 1]; 

G_(y) left continuous version of G(y) ; 
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g(e) 


probability mass function for a discrete intensity distribution; 


h(y;N) 

2 


a 

<|>(y;N) 


Z (»)y 4 (1-y) N 0 S y £ 1 where m = (N+1)/2; 

2,=m 

variance of the intensity distribution; 
h(y;N) - y. 


2.0 THE MODEL 


Suppose we are told that a particular software component, having input set 
fl, gives incorrect output when executing on inputs in some subset F of 0 
and gives correct output when executing on inputs in the complementary set F’. 
If all inputs arriving in the user environment belong to F, then the component 
is totally unreliable whereas if all inputs arrive in F’,. then the component 
is perfectly reliable. It is clear that some structure is required of the 
input process in order to evaluate reliability; for example, the inputs could 
alternate between F and F* or they may occur randomly in £i. 

We assume that an input series X 1 , X 2> ..., is stationary and independent; 
that is, successive inputs occur or are chosen at random in a series of 
independent trials according to a common distribution. Some software 
reliability models [9] and software testing experiments [10], [11] implicitly 
assume or suggest this structure. The common distribution, say Q, is the usage 
distribution which gives the probabilities Q(A) that successive inputs are 


chosen at random in subsets A of 0. 



At this stage of the discussion, other than the usage distribution itself, 
the full probabilistic structure of an input series is not needed. Our concern 
is mainly with the probability that a software component, and a redundant 
structure developed from a set of components, fails on successive trials. 

Let v(x), xeft denote the score function for a particular component: v(x) 
= 1 (v(x) = 0) if the component gives incorrect (correct) output when 
executing on xefl. Note that the subset F of Q for which the component 
gives incorrect output is {x: v(x) - 1}. The probability Q(F) that the 
particular component fails on successive trials is 

Q(F) = / v(x)dQ. (1) 

Now consider either a physically existing population of programmers who 
would design software to a given specification, or a conceptual population 
based on what would happen in a large number of repetitions of an experiment 
such as one which is designed to study the long term effectiveness of a fault- 
tolerant strategy. Let 0(x) describe the proportion of this population giving 
errors in the output when executing on xefi. This intensity function can be 
interpreted a number of ways: for example, it models the occurrence of 

coincident errors; it gives the probability that a software component, when 
chosen at random, fails on a particular input; and it describes a propensity 
for software components to fail together when executing on a single input. 

If a component is chosen at random, then for fixed xefl, its score function 
V(x) is a binary random variable taking values zero and one with probabilities 
1 - e(x) and 6(x) and, therefore, its mathematical expectation is 
E[V(x)] = 0(x) for each xeJ2. 
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As previously stated, (1) gives the probability that a particular component 
gives an error in its output. This probability, however, is a random variable 
which varies over repeated selections of software components. The mean of its 
distribution is 


E[/V(x)dQ] = /8(x)dQ. (2) 

The conceptual distinction between (1) and (2) is analogous to the process 
for estimating the reliability of hardware devices. That is, they capture, 
respectively, the difference between the reliability of a particular hardware 
device and the reliability of a population of devices of its type. While 
reliability predictions are actually desired for the device on hand, they are 
usually made on the basis of empirical results reported from testing a subset 
of the population. 

Neither the score function nor its expected value has introduced any 
assumptions to our model. However, when describing the reliability of a 
redundant structure, we need to state what is meant by independently designed 
versions of software components. We shall mean a set of n components which 
is chosen at random from a population so that: (a) {V^(x); xeO}, {V^(x); xeO}, 

...» {V n (x); xe£l} are independent collections of random variables and (b) for 

each xefl, V^x), V,,(x) V r (x) are identically distributed random 

variables. This assumption describes the usual conditions required of a random 
sample. The condition that V^(x), V^x), ..., V^Cx) are identically 
distributed implies that the probabilities /V^(x)dQ, i = 1, 2 n of 
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incorrect outputs, which are themselves random variables, vary according to a 
common distribution and the mean of this distribution is /0(x)dQ, as given 
earlier. Note that condition (a) is similar to the condition defining 
independence of a collection of stochastic processes indexed by a time 
parameter. It is also similar to the process of recording independent vector 
measurements for a sample of individuals taken from a human population. We 
emphasize that statistical independence in the current context refers only to 
the selection process and does not imply statistically independent failures 
among software components. This point is discussed further in Section 3.0. 

Although empirical studies of fault-tolerant software are not likely to 
often be conducted in the strict sense defining a random sample, repetitions of 
the version selection process does involve uncertainty concerning the subsets 
of Q on which the component versions fail. The probabilistic structure 
implied by the conditions defining a random sample gives a meaningful way to 
interpret experimental results when the main interest lies in the long term 
effectiveness of a fault-tolerant strategy rather than the study of a 
particular instance of its application. 

Now consider an N-Version (N = 1, 3, 5, ...) structure consisting of N 
software components, each designed to a common specification and required to 
execute on a single input series in the user environment. The outputs given 
after each execution are compared and, in case of disagreement, a consensus 
result is obtained by majority vote. An N-Version structure fails when 
executing on some subset F of Q and, as before, this subset is conveniently 
described by a score function v(x), xefl, which is 



(3) 


N 


v(x) = Z 


£=m 


v 1( 1 ) (x ) ...v i(11 )( x )[ 1 - 




where v i ( 2 )^ x ^’ •••> V i(N)^ * s a P ermutation of the score 

functions for the component versions. The second sum in (3) is over all 
distinct subsets of {1, 2, . .., N} and m = (N+1)/2 corresponds to the case of 
a redundant system that fails when at least a majority of its components fail. 


We now state the main result of this Section: 

Theorem 1. Under the condition that the component versions are the result of a 
random sample and each is required to execute on common inputs chosen at 
random, the expected probability of system failure is 


P 


N 


Sz N (”)[e(x)] £ [i - e(x)] N-il dQ. 
H=m 


(4) 


Proof. Upon conditioning on V^*), V 2 (*), ..., V^C*), the probability of 
failure is /v(x)dQ where v(*) is given by (3). Now taking the expectation 
inside the integral and using the independence of V^(*), ..., V^(*) due to 
sampling, together with the condition E[V^(x)] = 9(x), i = 1 , 2, ..., N, gives 
the desired result. 


12 



Although the main interest may often lie in the probability, /v(x)dQ (where 
v(x) is given by (3)), of failure of a particular N-Version system rather than 
the population average, p^, as given by (*0, the quantity, /v(x)dQ, will vary 
from one application to another and, unless we replace v(x) by its expected 
value as done in (^), there is no basis for further simplification. This same 
point was mentioned earlier when comparing (1) and (2) and, as before, is 
analogous to the difference between the reliability of a particular hardware 
device and the average reliability for a population of devices of its type. 

While e(x) , xefi together with N and the usage distribution completely 
specify p^, little empirical evidence is available from which to estimate 
e(x) , xeP; thus reasonable choices of the intensity to expect in applications 
is unclear. For this reason, we reparameterize p^ in terms of the following: 


N Q M-0 

h(y;N) = X (,)V Cl - y] N \ 0 < y < 1 
t=m 


and 


(5) 


G(y) = / dQ, < y < « (6) 

{ x: 0(x) £ y} 

We shall refer to G(y) as the intensity distribution which is induced by the 
mapping x -*• e(x) from £1 into [0, 1]. 
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Before proceeding to give a reparameterization of p^, consider the 

interpretation of G(y) in the discrete case which arises when 0(x) takes a 

finite number of values over subsets of 12. Suppose 6(x) = 0^ for xeA^ 

where A.^ , . . . , A^ is a partition of 12 and suppose the sets giving a 

common value under the mapping have been combined and indexed so that 0 £ 0^ < 

9„ < ... < 0 SI. Then, in this case, 

2 r 

G(y) = E q., -oo < y < a. ( 7 ) 

{i: S yj 

where = Q(A^), i = 1, 2 r is the probability mass given by the usage 

distribution. Since G(y) is right continuous, G(b) - G(a) gives the 
probability that inputs are chosen so that the proportion of a population of 
components that fail is in the range (a, b] , a < b (the upper limit, b, is 
included in the interval (a,b] but the lower limit, a, is not included). 

For later reference, we restate our earlier result in reparameterized form: 
Corollary. Under the conditions stated in the previous theorem, 

P N = /h(y ; N)dG(y) (8) 

where h(y; N) is given in (5) and G(y) is given by (6). 

The result follows by substitution (e.g., see [12], p. 213). 
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3.0 INDEPENDENT ERRORS 


The assumption that failures occur independently (in a statistical sense) 
in hardware components is a widely used and often successful model for 
predicting the reliability of hardware devices. Thus, it is tempting to assume 
that software components also fail independently and, on this basis, estimate 
the failure probability of a redundant N-Version system from 


N 

E 

£.=m 


,N S % 

( & ) p (1-p) 


N-A 


(9) 


This gives a computationally convenient formula for which the only 
information required i3 the average failure probability p of the software 
components. However, it clearly differs from the representation of p^ given 
earlier in (4). In this Section we ask whether independence implies a 
condition on the intensity distribution which is reasonable to expect in 
applications. Also, we ask whether it is correct to interpret a low intensity 
as implying statistical independence and a high intensity as implying 
statistical dependence in the context of coincident errors. 


Consider for the moment only two versions. Suppose, as before, they are 
chosen in a random sample and each is required to execute on common inputs 
chosen at random from D. The two versions fail independently if 


P(F n f 2 ) - P(F 1 ) • p(f 2 ) = 0. 


( 10 ) 
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We have 


P(F.) = /0(x)dQ, i = 1 , 2 (11) 

and 

P(F 1 O F 2 ) = E[/V 1 (x) V 2 (x)dQ] (12) 

where V^(*) and V^(*) are the score functions for the individual versions. 
Upon taking the expectation inside the integral in (12) and using the 
assumption that 1^(0 and V 2 (*) are the result of a random sample, we haye 


P(F 1 n F 2 ) = /0 2 (x)dQ. (13) 

Now the condition for independence, as stated in (10), is that 

/0 2 (x)dQ - / 0(x )dQ • /0(x)dQ = 0. (14) 

However, the term on the left is the variance, 

a 2 = /y 2 dG(y) - /ydG(y) • /ydG(y) , (15) 

of the intensity distribution and 

JydG = /0(x)dQ (16) 

is its mean. 


The variance of a distribution can equal zero only if the mass of the 
distribution is concentrated at a single point. Therefore, we state the 
following: 

Theorem 2. Under the conditions stated in the previous theorem, a necessary 
and sufficient condition for (unconditional) independent failure of the 
component versions is that 0(x) be constant except on a subset A of P for 
which Q(A) = 0. 
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Proof. In the general case, independence holds if 


n n 

P( 0 F ) = n P(F ), 

11 

or if, 

/0 n (x)dQ = [/0(x)dQ] n . 

By substitution, a constant intensity implies that , F^, ...» F^ are 
independent events. Conversely, independence of F^ , F^, ...» F^ implies 
pairwise independence which in turn implies a constant intensity as shown for 
the case n=2. 

A few words of explanation are in order to illustrate the difference 
between unconditional probabilities which are used in Theorem 2 and conditional 
probabilities that are appropriate when the discussion is limited to particular 
versions. This difference was discussed earlier following the statement of 
Theorem 1 and also when comparing (1) and (2). Suppose that two particular 
independently designed versions fail on inputs chosen from the sets 
F^ = (x:v^(x) = 1}, i = 1,2. The conditional probability (given the particular 
versions) that both versions fail on inputs chosen from is 

Q(F 1 n F 2 ) = / v 1 (x) v 2 (x)dQ 

and the individual conditional probabilities are 

Q(F.) = f v.(x)dQ, i = 1 ,2. 
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If F and F 2 are disjoint sets and if Q (F^ ) > 0, i = 1,2, then 
Q(F 1 O F 2 ) < Q(F 1 ) Q(F 2 ). 

Thus the two particular versions represent a case of negative (conditional) 
dependence. Further these two versions may have been chosen from a population 
having constant intensity. This does not invalidate the statement of Theorem 2 
for the same reason that a coin cannot be declared biased on the basis of 
observing two heads in two tosses. Repetitions of the process of selecting 
independently. designed versions would typically result in conditional 
probabilities which vary over repeated selections and it is the average of 
these conditional probabilities to which we refer in Theorem 2. 

A constant intensity is probably unreasonable to expect in most 
applications. For example, if for some population, none of the component 
versions fail on most inputs while a small percentage fail on a small portion 
of the inputs, then independence cannot hold. 

Now consider whether it is physically plausible that a constant intensity 
should imply the independent occurrence of errors in component versions. This 
same question can arise in the context of a coin tossing experiment. Suppose 
that if two similar coins (software components designed to a common 
specification) are tossed (execute) under one condition (on input x^) then the 
probabilities of each giving tails is .4, but if each is tossed under another 
condition (input x^), the probability of each giving tails is .6. Now if the 
condition (input) is chosen at random and the pair of coins is tossed, the 
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26 while the 


2 2 

probability of both giving tails is .5(.4) + . 5 ( . 6 ) = . 

probabilities that they individually give tails is .5(.6) + .5(.4) = .5. 

2 

Independence fails to hold (.26 * (.5) ) since the probability of tails varies 
with the input conditions. Independence in the software context is, therefore, 
no less plausible than for other experiments in which the results are given by 
a two-stage process. 

Even though the notion of a constant intensity might seem unacceptable at 
first, we assert that users of the independence model implicity make this 
assumption. Given that information concerning the intensity is unavailable, 
the most logical choice would be the average intensity /0(x)dQ, which is also 
the mean component failure probability. Substituting the average intensity for 
0 ( x ) in (4) gives the independence model. 

Our results show it is incorrect to interpret a low intensity as implying 

statistical independence and a high intensity as implying statistical 

2 

dependence. Rather the variance o of the intensity distribution gives a 
measure of departure from the independence model. However, a more useful 
approach may be to compare directly computations given by (8) and (9). This 
difference describes the effect of assuming independence when predicting the 
failure probability of an N-Version system. We examine this difference in a 
later section. 
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H.O A SUFFICIENT CONDITION FOR REDUNDANCY TO IMPROVE RELIABILITY 

Whereas estimates of p N , N = 1, 3, 5, ... can be given directly on the 
basis of a random sample of independently designed versions, such estimates 
would provide little insight concerning the effect of coincident errors. 
Moreover, in terms of efficiency, rather than examine a series of parameters to 
decide whether redundancy improves reliability, it is desirable to give a 
global condition which permits examining the intensity distribution. The 
difference in failure probabilities for the N-Version and single version cases 
is 

P N - p = /[h(y;N)-y]dG(y) (17) 

where G(y) is the intensity distribution and h(y;N) is given in (5). We 
desire a condition on G(y) which insures that (17) would be negative. Here and 
in later discussion of this problem we refer only to the case m = (N+1)/2. 

Insight into the type of condition required is gained by examining the 
integrand $(y;N) = h(y;N) - y appearing in (17). As shown in the Appendix, 
4>(y;N) is an antisymmetric function (a class of functions studied in [13]), 
with center of antisymmetry at .5; that is, 

<J>( .5 + y;N) = - <)>(.5 - y;N), 0 < y < .5. (18) 

In addition, <j>(y;N) is convex over the range 0 < y < .5, concave over .5 < y 
S 1, 4>(0;N) = <f>(.5;N) = <(> ( 1 ; N ) = 0, and <j>(y;N) lies below (above) the 

horizontal axis for 0<y<.5(.5<y<'1). The antisymmetry of <f>(y;N) 
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suggests that a sufficient condition for (17) to be negative is when the 
intensity distribution assigns greater mass to intervals of the type (.5 - b, 

•5 - a], 0 £ a < b, than to their symmetrically located counterparts [.5 + a, 
.5 + b). 

To describe this condition, we require that 

G(.5-a) - G(.5-b) £ G_(.5+b) - G_(.5+a) (19) 

for all 0 £ a < b where G_(y) is given by the left continuous version of 
G(y) ; namely, by 

G_(y) = / dQ (20) 

{x: 0(x) < y) 

Note that if equality holds in (19) for all 0 £ a < b, then G(y) is a 
symmetric distribution with center of symmetry at .5. Thus condition (19) 
describes an asymmetry of the intensity distribution relative to the center 
point of [0, 1]. 

The asymmetry condition (19) can also be described by either of the 
following conditions: 

G(.5 - y) + G_(.5 + y) is nonincreasing in y 2 0 (21) 

or 

G(y) + G_(1 - y) is nondecreasing in y Si .5. (22) 
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A sufficient condition under which redundant N-Veraion (N = 1, 3> 5, ... 
and m = (N+1)/2) structures "on the average" have smaller probability of 
failure than do single versions is as stated in the following: 

Theorem 3. If the intensity distribution satisfies the asymmetry condition 
(19), then / 4>(y;N)dG £ 0. Equality holds when G(y) is a symmetric 
distribution. 

Proof . 

Since <f> ( .5 ; N) = 0, 

.5 ~ 

/ <j>(y ;N)dG = / <j)(y;N) dG + / «|»(y;N)dG_ 

-» .5 

and by substitution, the expression on the right becomes 

OO 00 

-J <(>(.5 - y;N)dG(.5-y) + / 4>(.5 + y;N)dG ( .5+y) . 

0 0 

Now using the antisymmetry of <)>(y;N) gives 

GO 

; 4 >(y;N)dG . / *(.5 + y;N)d[G(.5 - y) +-G (.5 + y)]. 

0 

If G(y) is symmetric then G( .5 - y) + G_(.5 + y) is constant in y > 0 so 
that /<J>(y;N)dG = 0. On the other hand, if condition (19) holds then (21) 
implies that G(.5 - y) + G_(.5 + y) assigns a negative measure to each 
interval and implies the desired result. 

Although asymmetry of the intensity distribution is not a necessary 
condition, it does describe a wide class of cases for which an N-Version 
structure is better than a single version. In particular note that if 1 - 
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G( .5 ) = 0, then the sufficient condition is met; that is, if 6(x) $ .5 for 
xeft except on a set A for which Q(A) = 0, then an N-Version structure gives 
a smaller probability of failure than does a strategy based on a single 
version. 

Whereas for hardware devices the independence model and the average 
component failure probability, p, can be used to give a condition under which 
redundancy improves reliability, this is not true, in general, for redundant 
software subject to coincident errors. In particular, the average component 
failure probability being less than .5 does not imply that redundancy decreases 
system failure probability as is demonstrated in the next section. 

5.0 EFFECTS OF COINCIDENT ERRORS 

In this section we examine the effects of coincident errors on the failure 
probability, p , (N = 1, 3, 5, . . . ) of an N-Version software structure. Since 
coincidence, in the current context, refers to an intensity function 9(x), 
xefl, we are confronted with the problem of having to hypothesize a probability 
mass function (pmf ) , g(0), of the type suggested earlier in (7). We will 
assume a highly skewed distribution as in Table la to represent a form we 
believe is reasonable to expect in applications of software redundancy. 

The interpretation of g(9) is the probability of encountering an xeSJ 
whose coincidence intensity is the proportion 0. Thu3 ideally, we have high 
probabilities of encountering inputs that result in low values of 9 and 
significantly less probability of encountering the higher intensity 
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If 

1 ; y. 

coefficients at the tail of the distribution. For the given pmf, we wouid 
expect all (i.e. 0=0) of the programs of our population to provide correct 
outputs on 98. 98$ of the input cases. The average failure probability for a j 

single version (which is the same as the mean of the intensity distribution) is * j: 

-11 ■ !■• 

p = E0g( 0) = 2 x 10 . 


0 

g( 0) 

e 

g,(9) 

g 2 <e> 

83(6) 

0 

.98977 

0 

.99999 

.99997 

.99993 

.01 

.00512 

.05 

.00001 

.00002 

.00004 

.02 

.00256 

.10 

0 

.00001 

.00002 

.03 

.00128 

.15 

0 

0 

.00001 ! 

.0* 

.0006* 





.05 

.00032 




.. f 

.06 

.00016 



(b) 


.07 

.00008 





.08 

.00004 




I 

.09 

.00002 





.10 

.00001 






(■) 


e 

c(e) 

0 

g(0) ' 

0 

.99899 

0 

i 

.99998 

.10 

.00100 

.05 

.00100 

.50 

.00001 

.60 

.00001 1 


(c) (d) 


e 

g(0) 

0 

g( 0 ) 

0 

.99999 

0 

.99998 

.60 

.00001 , 

.10 

.00001 



.60 

.00001 


(e) 


(f) 

Table 1. 

-Probability mass 

functions for 

figures 
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Pr jSyatei Failure! 


Effect of Independence Assumption 


The expected system failure probability on the basis of the pmf of Table la 

is shown in Figure 1. Also shown is the result of assuming independent errors. 

It is evident that increasing N does substantially reduce the probability of 

incorrect output for an N-Version system. A N~5 version system, for example, 

will reduce this failure probability by approximately two orders of magnitude 

relative to that of a single version. However, also evident is the fact that 

the assumption of independent errors leads to predictions of improvement of 

more than five orders of magnitude. This underestimation can be seen another 

way: it would take seventeen versions from a population whose average failure 

-4 -9 

probability is 2 x 10 to produce a system with < 10 rather than the five 
versions when independence is assumed. 
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Figure l. - Effect of independent errors assumption. 



Pr {Systca Failure! 


Effect of Shifted Intensity Distribution 

Figure 2 shows the effect of shifting the mass points of the intensity 

distribution to the right, thereby, increasing the intensity of coincident 

errors. The coincident errors increase from a maximum of 5.0 percent for g^(0) 

to 15.0 percent for g^(0) as shown in Table 1b. This shift has degraded 

—7 —6 

average component failure probability, p, from 5.0 x 10 to 5.5 x 10 . If 

-q 

these components were used in a critical application requiring p^ < 10 then 
twenty-one components would be required from the population with g^(0) compared 
to nine components corresponding to g.j(0). 



N 

Figure 2. - Effect of a shifted intensity distribution. 
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Pr jSysten Failure) 


The Limiting Probability of System Failure 

Here we examine the limiting value of as N increases. Using property 

(ii) of the Appendix it is easily shown that this limiting value is 

1 

lira p M - .5[G(.5 + )-G(.5-)] + / dG(0). (23) 

N .b+ 

This effect is illustrated in Figure 3 using the pmf of Table 1c. 



N 

Figure 3. - Limit on Pr {System Failure } . 

Although it is true for this example that a fault-tolerant approach is 
better than a single version of software, the coincidence mass points 
distributed along the interval .5 £ 0 £ 1 limits the reliability that can be 
obtained with fault tolerance. For this example p^ can never fall below 
5 x 10 ** with any degree of fault tolerance. 
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Pr jSystem Failure}' 


A Condition For System Degradation In The Limit. 

Consider the pmf of Table Id and the corresponding p^ shown in Figure 4. 
Here we have a case where the value of N corresponding to the minimum failure 
probability is not the limiting case (N ♦ ®) but rather an intermediate value, 

N - 7. Increasing N beyond this point actually degrades the system. What 
has been the condition that has brought about this degradation with increasing 
N? 



N 

Figure 4. - Existence of optimal N. 

This condition will exist when the failure probability for some S’-Version 
system is less than the limiting failure probability, i.e., when for some N, 

00 

P N < .5[G( .5 + ) - G( .5”) ] ♦ ; dG( 9 ) (24) 

.5+ 
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Using (8) for this can be written as 
• 5 ™ 00 

/ h(9;N)dG(0) + / [h(0;N) - 1 ] dG_ ( 0 ) < 0. (25) 

-CD .5 + 

Using the symmetry, h(0; N) = 1 - h ( 1 - 0; N), 
we have 

.5- 

f h(0; N)d[G( 0) + G_(1 - 0)] < 0. • (26) 

— 00 

The sufficiency condition of Theorem (2) implies that G(0) + G (1 - 0) is 
increasing for 0 £ .5 which is inconsistent with inequality (26) above. 
Therefore, a necessary condition for a system to degrade in the limit is a 
violation of the sufficiency condition of Theorem (2). 

This example illustrates the possibility of coincident errors causing an 
increase in system failure probability with increasing N. However, the end 
result is still better than a single version system. Also note that the 
sufficiency condition given in Theorem 2 is not a necessary condition for 

P N < P - 

Effect of Highly Coincident Errors 

As we have shown earlier, certain intensity functions can result in an N- 
Vers ion system being more prone to failure than a single software component. 

An example of this, although perhaps highly unlikely, is shown in Figure 5a 
(corresponding to the pmf of Table 1e). Here all programs produce correct 
output except for a subset A of the input space for which 0(x) = 0 = .6, 
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xeA. Thus for this subset, 60 percent of the population would produce an 
error. In this case it is clear why increasing N degrades system 
reliability. In the case of the independence model, if the average component 
failure probability, p, exceeds .5, it becomes increasingly more difficult with 
increasing N to realize a majority of components having correct output. 
Similarly, for the coincident error model, if 0(x) > .5 for x in some 
subset A for which Q(A) > 0, it also becomes increasingly more difficult with 
increasing N to realize a majority of components having correct output. 
Moreover, conditions could exist when one must specify a value of N in order 
to assess whether N-Version is better than a single version. This is 
illustrated in Figure 5b (corresponding to the pmf of Table If). Increasing N 
initially decreases system failure probability but eventually heads for its 
limiting value which is worse than for a single component. 
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Pr (System Failure! Pr (System Failure) 
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Figure 5. — Effect of highly coincident errors. 
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6.0 CONCLUSIONS 


The application of redundancy to hardware components has long been 
established as an effective methodology for increasing reliability. Its 
application to software is a relatively new and untested technology largely 
motivated by the need for high reliability in life-critical applications such 
as flight control. Thus, at least in the initial stage of studying fault- 
tolerant software, much interest is likely to lie in evaluating the long term 
effectiveness of a fault-tolerant strategy rather than in examining only a 
single instance in which, for example, a particular system has smaller failure 
probability than its component versions. 

In this paper a theoretical basis for the analysis of redundant software 
has been developed which directly links certain basic quantities with the 
experimental process of testing independently designed software components. We 
used this model to study in some detail the case of N-Version redundancy in 
which the system fails if at least a majority of its components fail. Our main 
conclusion in this case is that if the intensity distribution is asymmetric in 
a certain way (see Section 4), then we can ensure that an N-Version strategy is 
better than one based on using a single software component. 

This condition differs sharply from what is required on the basis of the 
independence model commonly used to estimate the reliability of hardware 
devices. In the latter case, a necessary and sufficient condition (assuming an 
N-modular redundant system which fails if a majority of its components fail) 
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for redundancy to improve reliability over that of a single component is that 
the component failure probability be less than .5 and, further, system 
reliability would then increase as the number of components is increased. The 
same thing cannot be said of redundant software systems which are subject to 
coincident errors (see Section 5). 

This only points out one major difference between the type of model needed 
for redundant software and the independence model used for hardware devices. 

Our model also gives some insight concerning the validity of assuming that 
software components fail independently in a statistical sense. A low 
coincidence of errors does not describe independence. Rather a constant 
intensity characterizes the case of independence and the variance of the 
intensity distribution measures departure from the independence model. We 
believe a constant intensity is a condition unlikely to hold in most 
applications. Therefore, the combinatorial method, based on independence and 
requiring only information concerning the failure probability of component 
versions, is unlikely to give accurate estimates when applied to redundant 
software systems. 

We have illustrated the effects of coincident errors on the failure 
probability of redundant software systems. It is clear that redundancy under 
certain conditions can improve reliability. However, the effects of coincident 
errors, as a minimum, required an increase in the number of software components 
greater than would be predicted by calculations using the combinatorial method 
which assumes independence. Further, the effects of a high intensity of 
coincident errors can be much more serious to the extent of making a fault 
tolerant approach, on average, worse than using a single version. Here again 



we must reassert that the assumption we are making is that we equate the 
process of developing a single version with that of randomly selecting a 
program from a population of programs which have been independently developed. 

For purposes of illustration we have postulated in some cases a rather high 
intensity of coincident errors. It is clear we need empirical data to truly 
assess the effects of these errors on highly reliable software systems. 
Additionally, efforts to identify the sources of coincident errors and to 
develop methods to reduce their intensity (hopefully that will come with an 
understanding of the common source of the errors) will not only benefit the 
development of fault-tolerant software but also software engineering in 
general. 
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APPENDIX 


Here we summarize some properties of h(y;N). A real valued function f(y), 
say, is antisymmetric [13] on [0, 1] with center at .5 if 

f(.5-y) + f(y+.5) ** 2f(.5), 0 £ y S .5. (A.1) 

The function h(y;N) given by (5) for N - 1, 3, 5, ... and m = (N+1)/2 
can be written 

y 

h(y;N) = N!(k!) 2 / u k (1-u) k du, 0 £ y £ 1, (A. 2) 

o 

where k = (N-1)/2; this is a well-known formula [I 1 !] for a sum of binomial 
terms. 

The main properties of interest concerning h(y;N) and d> (y ; N ) = h(y;N)-y 

are: 

(i) h ( 0 ; N ) = 0, h(1;N) = 1 and h(.5;N) = .5 for N = 1, 2, 3, ...; 

(ii) as lim h(y;N) = 0, .5, 1 whenever y<.5, y=.5, and y>.5, 

respectively; 

(iii) h(y;N) is antisymmetrical with center at .5; 

(iv) h(y;N) is convex on [0,.5] and is concave on [.5,1]; 

(v) 4> ( y ; n ) is antisymmetrical with center at .5 and 4>(0;N) = <J>( .5;N) = <(>( 1 ;N)=0 

for N 3 1, 3, 5 , ...; 

(vi) 4> ( y ; N ) is convex on [0, .5] and is concave on [.5,1]; 

(vii) h(y;N) is nonincreasing in N = 1, 3, 5, ... for y<.5; 

(viii) h(y;N) is nondecreasing in N = 1, 3, 5, ... for y>.5. 
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Proof. The result (i) follows by substitution and by symmetry of the binomial 
distribution when y=.5; (ii) follows from the weak law of large numbers applied 
to the binomial distribution; (iv) and (vi) can be seen directly by examining 
the second derivatives of h(y;N) and 4> (y ;N ) . 

To prove (iii), note that symmetry of the integrand in (A. 2) gives 

-2 k k -o ^ k k 

N!(k!) / u (1-u) du = N! (k! ) / u (1-u) du 

o .5-y 

where the term on the left is h(.5+y;N) and the term on the right is 1 — h( .5— 
y;N). Therefore, h(.5+y;N) + h(.5~y;N) = 1 = 2h(.5;N). Now (v) also follows 
by using the antisymmetry of h(y;N) established in (iii). 

To prove (vii), let f(y) = h(y;N+2)/h(y;N) and use (A. 2) to get 

f (y) = c / u k+1 ( 1-u) k+1 du/ / u k (1-u) k du 
o o 

where c = (N+2) (N+1 ) (k+1 ) 2 , k=0, 1, 2, .... The derivative 3/8y{f(y)} is 
nonnegative when y < .5 providing 

y( 1 -y) / u k (1-u) k du - / u k+ ^ ( 1-u) k+ ^du £ 0. 

o o 

But u(1-u) when 0 S u £ y < .5 takes the maximum value y ( 1 — y ) so that 

y k k y k k 

/ u(1-u)u (1-u) du ^ y(1-y) / u (1-u) du 

o o 
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which proves that f(y) is nondecreasing for 0 £ y £ .5. This proves (vii) 
since f(.5) - 1. Since h(.5+y,N) = 1 - h(.5-y;N) and h(.5-y;N) is 
nonincreasing in N = 1 , 3» 5, . . . , we have also proved (viii). 
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